Why a safety-trained model can still be steered off its rails — and what that tells a defender
Day 21 of 60
For four weeks you've built the machinery of behaving-well-by-default: taxonomies, policies, red-teams, evals. This week confronts the thing that machinery can't fully fix. A model can be helpful, well-policied, and safety-tuned — and still collapse under inputs engineered to break it. That's the week's thesis: capability is not robustness, and a behavior the model exhibits 99.9% of the time can be reliably defeated by an adversary who only needs it to fail once.
Safety training shifts a model's default behavior; it does not install a guarantee. A jailbreak is any input that moves the model off that default into disallowed territory. We study how they work only to defend — to know which layer catches which attack, where the gaps are, and why no single safeguard is enough.
A note on framing before we go further: everything below is taught at the mechanism level. You will not find working strings, suffixes, or recipes here, and you don't need them. A defender's edge comes from understanding why a class of attack works, not from being able to run one.
The most useful mental model comes from Jailbroken: How Does LLM Safety Training Fail? (Wei et al., 2023). It names two failure modes that explain almost every jailbreak you'll ever see.
The model is trained to be helpful and to be harmless at the same time. An attacker constructs a context where those two goals collide — where being helpful, following instructions, or staying in character pulls against the refusal. When the helpful objective wins the tug-of-war, you get a jailbreak. The safety behavior wasn't deleted; it was out-pulled.
Safety tuning generalizes over the distribution of inputs it was trained on. Push the input far enough outside that distribution — unusual encodings, rare framings, formats the safety data never covered — and the harmlessness behavior simply doesn't fire, because the model doesn't recognize the situation as the kind it was taught to refuse.
Competing objectives = the safety behavior is present but overpowered. Mismatched generalization = the safety behavior never activates because the input looks unfamiliar. Almost every jailbreak family is one of these two, or both. Naming which one you're looking at is the first move of a defender.
The other landmark read is Universal and Transferable Adversarial Attacks on Aligned LMs (Zou et al., 2023), the GCG paper. Read it strictly as a defender: the result that matters is not any particular string but the shape of the finding. Attacks can be found automatically by optimization rather than human cleverness, and a suffix optimized against one open model can transfer to others it was never tuned on.
If attacks are machine-discoverable and transferable, then "we patched the ones we found" is not a defense — the search space is effectively infinite and shared across models. This is the argument for treating robustness as a property of the whole stack, not of the safety-tuning layer alone. Hold that thought; Day 23 makes it concrete.
You'll meet these again, so anchor them now: direct jailbreaks (the user crafts the abusive input themselves), prompt injection (instructions arrive through content the model processes — tomorrow's topic, and the scary one for agents), and multimodal evasion (the abusive instruction hides in an image, audio, or other non-text channel a text-only filter never sees). Each maps to different defensive layers, which is exactly why a single safeguard can't cover all three.
A novice sees a jailbreak as a clever trick to collect. An expert sees it as a diagnosis: this input won because helpfulness out-pulled harmlessness, or because the input fell outside the safety distribution. The altitude jump is from cataloguing attacks to explaining them — because an explanation generalizes to attacks you haven't seen, and a catalogue doesn't.
Say this in an interview: "I think about jailbreaks through two lenses — competing objectives and mismatched generalization. That tells me safety tuning is necessary but not sufficient: it sets a default, not a guarantee. So I design for defense-in-depth and assume any single layer, including the model's own training, can be bypassed."